DataFrame Object

The pandas DataFrame object extends the capabilities of the Series object into 2-dimensions. A Series object adds an index to a NumPy array but can only associate a single data item per index label, a DataFrame integrates multiple Series objects by aligning them along common index labels.

A DataFrame object can be thought of as a dictionary-like container of one or more Series objects or as a spreadsheet or even to a database table.

A DataFrame also introduces the concept of multiple axes, specifically the horizontal and vertical axis. Functions from Pandas can then be applied to either axis and the operation can be applied horizontally to alll the values in the rows or up and down each column.

Creating DataFrame


In [1]:
# import NumPy and pandas
import numpy as np
import pandas as pd

# set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows',10)

In [3]:
# create a DataFrame from a 2-d array
pd.DataFrame(np.array([[10,11],[20,21]]))


Out[3]:
    0   1
0  10  11
1  20  21

In [4]:
# create a DataFrame from a list of Series objects
df1 = pd.DataFrame([pd.Series(np.arange(10,15)),pd.Series(np.arange(15,20))])
df1


Out[4]:
    0   1   2   3   4
0  10  11  12  13  14
1  15  16  17  18  19

In [5]:
# what is the shape of the data frame
df1.shape


Out[5]:
(2, 5)

In [7]:
# specify column names
df = pd.DataFrame(np.array([[10,11],[20,21]]), columns=['a','b'])
df


Out[7]:
    a   b
0  10  11
1  20  21

In [9]:
# what are the name of he columns
df.columns


Out[9]:
Index(['a', 'b'], dtype='object')

In [10]:
# retrieve just the names of the columns by position
"{0},{1}".format(df.columns[0],df.columns[1])


Out[10]:
'a,b'

In [11]:
# rename the columns
df.columns = ['c','d']
df


Out[11]:
    c   d
0  10  11
1  20  21

In [13]:
# create a dataframe with named rows and columns
df = pd.DataFrame(np.array([[10,11],[20,21]]),columns=['col1','col2'],index=['row1','row2'])
df


Out[13]:
      col1  col2
row1    10    11
row2    20    21

In [14]:
df.index


Out[14]:
Index(['row1', 'row2'], dtype='object')

In [15]:
# create a dataframe with 2 series objects
# and a dictionary
s1 = pd.Series(np.arange(1,6,1))
s2 = pd.Series(np.arange(6,11,1))
pd.DataFrame({'c1':s1,'c2':s2})


Out[15]:
   c1  c2
0   1   6
1   2   7
2   3   8
3   4   9
4   5  10

A DataFrame also performs automatic alignment of the data for each Series passed in by a dictionary. For example, the following code adds a third column in the DataFrame initialisation.


In [16]:
# demonstrate alignment during creation
s3 = pd.Series(np.arange(12,14),index=[1,2])
df = pd.DataFrame({'c1':s1,'c2':s2,'c3':s3})
df


Out[16]:
   c1  c2    c3
0   1   6   NaN
1   2   7  12.0
2   3   8  13.0
3   4   9   NaN
4   5  10   NaN

In the above example, first two Series did not have an index specified so they were indexed with default labels 0..4. The third Series has index values, and therefore the values for those indexes are placed in DataFrame in the row with the matching index from the previous columns.

Example Dataset

S&P500 is a sample file from Yahoo! Finance.


In [17]:
# show the first 3 lines of the file
!head -n 3 ../../data/sp500.csv





In [28]:
sp500 = pd.read_csv("../../data/sp500.csv",index_col='Symbol',usecols=[0,2,3,7])
# first five rows
sp500.head()


Out[28]:
                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  141.14      26.668
ABT                Health Care   39.60      15.573
ABBV               Health Care   53.95       2.954
ACN     Information Technology   79.79       8.326
ACE                 Financials  102.91      86.897

In [20]:
# last 5 rows
sp500.tail()


Out[20]:
                        Sector   Price  Book Value
Symbol                                            
YHOO    Information Technology   35.02      12.768
YUM     Consumer Discretionary   74.77       5.147
ZMH                Health Care  101.84      37.181
ZION                Financials   28.43      30.191
ZTS                Health Care   30.53       2.150

In [21]:
# how many rows of data
len(sp500)


Out[21]:
500

In [42]:
# examine the index
sp500.index


Out[42]:
Index(['MMM', 'ABT', 'ABBV', 'ACN', 'ACE', 'ACT', 'ADBE', 'AES', 'AET', 'AFL',
       ...
       'XEL', 'XRX', 'XLNX', 'XL', 'XYL', 'YHOO', 'YUM', 'ZMH', 'ZION', 'ZTS'],
      dtype='object', name='Symbol', length=500)

In [23]:
# get the columns
sp500.columns


Out[23]:
Index(['Sector', 'Price', 'Book Value'], dtype='object')

In [24]:
# second dataset
!head -n 3 ../../data/omh.csv


Date,MSFT,AAPL
2014-12-01,48.62,115.07
2014-12-02,48.46,114.63

In [25]:
# read in the data
one_mon_hist = pd.read_csv("../../data/omh.csv")
one_mon_hist[:3]


Out[25]:
         Date   MSFT    AAPL
0  2014-12-01  48.62  115.07
1  2014-12-02  48.46  114.63
2  2014-12-03  48.08  115.93

Selecting Columns of a DataFrame


In [36]:
# get first and second columns by position or location
sp500.columns


Out[36]:
Index(['Sector', 'Price', 'Book Value'], dtype='object')

In [47]:
type(sp500)


Out[47]:
pandas.core.frame.DataFrame

In [62]:
sp500[sp500.columns[1]].head()


Out[62]:
Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
Name: Price, dtype: float64

In [50]:
df = sp500['Price']
df


Out[50]:
Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

In [64]:
# create a new dataframe with integers as the columns names
# make sure to use .copy() or change will be in-place
df = sp500.copy()
df.columns = [0,1,2]
df.head()


Out[64]:
                             0       1       2
Symbol                                        
MMM                Industrials  141.14  26.668
ABT                Health Care   39.60  15.573
ABBV               Health Care   53.95   2.954
ACN     Information Technology   79.79   8.326
ACE                 Financials  102.91  86.897

In [66]:
# because the column names are actually integers
# and therefore [1] is found as a column
df[1]


Out[66]:
Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: 1, Length: 500, dtype: float64

In [67]:
df.columns


Out[67]:
Int64Index([0, 1, 2], dtype='int64')

In [68]:
# get price column by name
# result is a series
sp500['Price']


Out[68]:
Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

In [70]:
# get price and sector columns
# since a list is passed, the result is a DataFrame
sp500[['Price','Sector']]


Out[70]:
         Price                  Sector
Symbol                                
MMM     141.14             Industrials
ABT      39.60             Health Care
ABBV     53.95             Health Care
ACN      79.79  Information Technology
ACE     102.91              Financials
...        ...                     ...
YHOO     35.02  Information Technology
YUM      74.77  Consumer Discretionary
ZMH     101.84             Health Care
ZION     28.43              Financials
ZTS      30.53             Health Care

[500 rows x 2 columns]

In [71]:
# attribute access of the column by name
sp500.Price


Out[71]:
Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

In [72]:
loc = sp500.columns.get_loc('Price')
loc


Out[72]:
1

Selecting rows and values of a DataFrame using the index


In [73]:
# first five rows
sp500[:5]


Out[73]:
                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  141.14      26.668
ABT                Health Care   39.60      15.573
ABBV               Health Care   53.95       2.954
ACN     Information Technology   79.79       8.326
ACE                 Financials  102.91      86.897

In [74]:
sp500['ABT':'ACN']


Out[74]:
                        Sector  Price  Book Value
Symbol                                           
ABT                Health Care  39.60      15.573
ABBV               Health Care  53.95       2.954
ACN     Information Technology  79.79       8.326

In [75]:
sp500.loc['ACN']


Out[75]:
Sector        Information Technology
Price                          79.79
Book Value                     8.326
Name: ACN, dtype: object

In [78]:
sp500.loc[['MMM','MSFT']]


Out[78]:
                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  141.14      26.668
MSFT    Information Technology   40.12      10.584

In [79]:
# get rows in locations 0 and 2
sp500.iloc[[0,2]]


Out[79]:
             Sector   Price  Book Value
Symbol                                 
MMM     Industrials  141.14      26.668
ABBV    Health Care   53.95       2.954

In [80]:
# get the location of MMM and A in the index
i1 = sp500.index.get_loc('MMM')
i2 = sp500.index.get_loc('A')
"{0},{1}".format(i1,i2)


Out[80]:
'0,10'

In [81]:
# and get the rows
sp500.iloc[[i1,i2]]


Out[81]:
             Sector   Price  Book Value
Symbol                                 
MMM     Industrials  141.14      26.668
A       Health Care   56.18      16.928

Selecting rows by index label and/or location: .ix[]


In [82]:
# by label
sp500.ix[['MSFT','ZTS']]


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  
Out[82]:
                        Sector  Price  Book Value
Symbol                                           
MSFT    Information Technology  40.12      10.584
ZTS                Health Care  30.53       2.150

Scalar lookup by label or location using .at[] and iat[]


In [84]:
# by label in both the index and column
sp500.at['MMM','Price']


Out[84]:
141.13999999999999

In [85]:
# by location. Row 0, column 1
sp500.iat[0,1]


Out[85]:
141.13999999999999

Selecting rows of a DataFrame by Boolean selection


In [87]:
# what rows have a price < 100 ?
sp500.Price < 100


Out[87]:
Symbol
MMM     False
ABT      True
ABBV     True
ACN      True
ACE     False
        ...  
YHOO     True
YUM      True
ZMH     False
ZION     True
ZTS      True
Name: Price, Length: 500, dtype: bool

In [88]:
# get only the Price where price is < 10 and > 0
r = sp500[(sp500.Price < 10) & (sp500.Price > 0)] [['Price']]
r


Out[88]:
        Price
Symbol       
FTR      5.81
HCBK     9.80
HBAN     9.10
SLM      8.82
WIN      9.38

Modifying the structure and content of DataFrame


In [89]:
# rename the Book Value colun to not have a space
# this returns a copy with the column renamed
df = sp500.rename(columns={'Book Value':'BookValue'})
# print first 2 rows
df[:2]


Out[89]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573

In [90]:
df.columns


Out[90]:
Index(['Sector', 'Price', 'BookValue'], dtype='object')

In [91]:
# old dataframe remains intact
sp500.columns


Out[91]:
Index(['Sector', 'Price', 'Book Value'], dtype='object')

In [93]:
# this changes the column in-place
sp500.rename(columns={'Book Value':'BookValue'},inplace=True)
sp500.columns


Out[93]:
Index(['Sector', 'Price', 'BookValue'], dtype='object')

In [94]:
sp500.BookValue[:5]


Out[94]:
Symbol
MMM     26.668
ABT     15.573
ABBV     2.954
ACN      8.326
ACE     86.897
Name: BookValue, dtype: float64

Adding and Inserting Columns


In [95]:
# make a copy
copy = sp500.copy()
copy['TwicePrice'] = sp500.Price * 2
copy[:2]


Out[95]:
             Sector   Price  BookValue  TwicePrice
Symbol                                            
MMM     Industrials  141.14     26.668      282.28
ABT     Health Care   39.60     15.573       79.20

In [96]:
copy = sp500.copy()
copy.insert(2,'TwicePrice',sp500.Price*2)
copy[:2]


Out[96]:
             Sector   Price  TwicePrice  BookValue
Symbol                                            
MMM     Industrials  141.14      282.28     26.668
ABT     Health Care   39.60       79.20     15.573

It is important to remember that this is not simply inserting a column into the DataFrame. The alignment process used here is performing a left join of the DataFrame and the Series by their index labels and then creating the column and populating data in the appropriate cell in the DataFrame.


In [97]:
# extract the first 4 rows and Price column
rcopy = sp500[0:3][['Price']].copy()
rcopy


Out[97]:
         Price
Symbol        
MMM     141.14
ABT      39.60
ABBV     53.95

In [100]:
# create a new series to merge as a column
# one label exists in rcopy(MSFT) and MM does not
s = pd.Series({'MMM':'Is in the DataFrame','MSFT':'Is not in the DataFrame'})
s


Out[100]:
MMM         Is in the DataFrame
MSFT    Is not in the DataFrame
dtype: object

In [101]:
rcopy['Comment'] = s
rcopy


Out[101]:
         Price              Comment
Symbol                             
MMM     141.14  Is in the DataFrame
ABT      39.60                  NaN
ABBV     53.95                  NaN

Replacing the contents of a column


In [102]:
# replace the Price column data with the new values
# instead of adding a new column
copy = sp500.copy()
copy.Price = sp500.Price * 2
copy[:5]


Out[102]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  282.28     26.668
ABT                Health Care   79.20     15.573
ABBV               Health Care  107.90      2.954
ACN     Information Technology  159.58      8.326
ACE                 Financials  205.82     86.897

In [103]:
# copy all 500 rows
copy = sp500.copy()
prices = sp500.iloc[[3,1,0]].Price.copy()
# examine the extracted prices
prices


Out[103]:
Symbol
ACN     79.79
ABT     39.60
MMM    141.14
Name: Price, dtype: float64

In [104]:
# now replace the Price column with prices
copy.Price = prices
# it is not really simple insertion, it is alignment
# values are put in the correct place according to labels
copy


Out[104]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  141.14     26.668
ABT                Health Care   39.60     15.573
ABBV               Health Care     NaN      2.954
ACN     Information Technology   79.79      8.326
ACE                 Financials     NaN     86.897
...                        ...     ...        ...
YHOO    Information Technology     NaN     12.768
YUM     Consumer Discretionary     NaN      5.147
ZMH                Health Care     NaN     37.181
ZION                Financials     NaN     30.191
ZTS                Health Care     NaN      2.150

[500 rows x 3 columns]

Deleting Columns in a DataFrame

Columns can be deleted from a DataFrame by using the del keyword, the pop(column) method of the DataFrame, or by calling the drop() method of the DataFrame.

  • del will simply delete the series from the dataframe
  • pop() will both delete the series and return the series as a result (also in-place)
  • drop(labels,axis=1) will return a new dataframe with the column(s) removed, original DataFrame object will not be modified)

Using del


In [105]:
# Example of using del to delete a column
# make a copy of a subset of the data frame
copy = sp500[:2].copy()
copy


Out[105]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573

In [106]:
del copy['BookValue']
copy


Out[106]:
             Sector   Price
Symbol                     
MMM     Industrials  141.14
ABT     Health Care   39.60

Using pop


In [108]:
# Example of using pop to remove a column from a DataFrame
# first make a copy of a subset of the data frame
# pop works in-place
copy = sp500[:2].copy()
popped = copy.pop('Sector')
copy


Out[108]:
         Price  BookValue
Symbol                   
MMM     141.14     26.668
ABT      39.60     15.573

In [110]:
# and we have the Sector column as the result of the pop
popped


Out[110]:
Symbol
MMM    Industrials
ABT    Health Care
Name: Sector, dtype: object

Using drop

The drop() method can be used to remove both rows and columns. To use it to remove a column, specify axis=1:


In [111]:
# Example of using drop to remove a column
# make a copy of a subset of the DataFrame
copy = sp500[:2].copy()
afterdrop = copy.drop(['Sector'],axis=1)
afterdrop


Out[111]:
         Price  BookValue
Symbol                   
MMM     141.14     26.668
ABT      39.60     15.573

Adding rows to a DataFrame

  • Appending a DataFrame to another
  • Concatenation of two DataFrame objects
  • Setting with enlargement

Appending rows with .append()

Note: Appending does not perform alignment and can result in duplicate index values.


In [115]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy the 10th and 11th rows
df2 = sp500.iloc[[10,11,2]]
# append df1 and df2
appended = df1.append(df2)
# the result is the rows of the first followed by those in the second
appended


Out[115]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954
A       Health Care   56.18     16.928
GAS       Utilities   52.98     32.462
ABBV    Health Care   53.95      2.954

The set of columns of the DataFrame objects being appended do not need to be the same. The resulting DataFrame will consist of the union of the columns in both and where either did not have a columns, NaN will be used as the value.


In [117]:
# DataFrame using df1.index and just a PER columns
# also a good example of using a scalar value
# to initialize multiple rows.
df3 = pd.DataFrame(0.0,index=df1.index,columns=['PER'])
df3


Out[117]:
        PER
Symbol     
MMM     0.0
ABT     0.0
ABBV    0.0

In [118]:
# append df1 and df3
# each has three rows, so 6 rows is the result
# df1 had no PER column, so NaN for those rows
# df3 had no BookValue, Price or Sector, so NaN values
df1.append(df3)


Out[118]:
        BookValue  PER   Price       Sector
Symbol                                     
MMM        26.668  NaN  141.14  Industrials
ABT        15.573  NaN   39.60  Health Care
ABBV        2.954  NaN   53.95  Health Care
MMM           NaN  0.0     NaN          NaN
ABT           NaN  0.0     NaN          NaN
ABBV          NaN  0.0     NaN          NaN

To append without forcing the index to be taken from either DataFrame, you can use the ignore_index=True parameter.


In [120]:
# ignore index labels, create default index
df1.append(df3,ignore_index=True)


Out[120]:
   BookValue  PER   Price       Sector
0     26.668  NaN  141.14  Industrials
1     15.573  NaN   39.60  Health Care
2      2.954  NaN   53.95  Health Care
3        NaN  0.0     NaN          NaN
4        NaN  0.0     NaN          NaN
5        NaN  0.0     NaN          NaN

Concatenating DataFrame objects with pd.concat()

This function functions similarly to the .append() method but also adds the ability to specify an axis (appending can be row or column based)

The default is row based similar to append() method.


In [121]:
# copy the first 3 rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy 10th and 11th rows
df2 = sp500.iloc[[10,11,2]]
## pass them as a list
pd.concat([df1,df2])


Out[121]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954
A       Health Care   56.18     16.928
GAS       Utilities   52.98     32.462
ABBV    Health Care   53.95      2.954

Actually, pandas calculates the sorted union of distinct column names across all supplied objects and uses those as the columns and then appends data along the rows for each object in the order given in the list.


In [122]:
# copy df2
df2_2 = df2.copy()
# add column to df2_2 that is not in df1
df2_2.insert(3,'Foo',pd.Series(0,index=df2.index))
df2_2


Out[122]:
             Sector  Price  BookValue  Foo
Symbol                                    
A       Health Care  56.18     16.928    0
GAS       Utilities  52.98     32.462    0
ABBV    Health Care  53.95      2.954    0

In [123]:
# now concatenate
pd.concat([df1,df2_2])


Out[123]:
        BookValue  Foo   Price       Sector
Symbol                                     
MMM        26.668  NaN  141.14  Industrials
ABT        15.573  NaN   39.60  Health Care
ABBV        2.954  NaN   53.95  Health Care
A          16.928  0.0   56.18  Health Care
GAS        32.462  0.0   52.98    Utilities
ABBV        2.954  0.0   53.95  Health Care

Using the keys parameter, it is possible to differentiate the pandas objects from which the rows originated. The following code adds a level to the index which represents the source object.


In [124]:
# specify keys
r = pd.concat([df1,df2_2],keys=['df1','df2'])
r


Out[124]:
            BookValue  Foo   Price       Sector
    Symbol                                     
df1 MMM        26.668  NaN  141.14  Industrials
    ABT        15.573  NaN   39.60  Health Care
    ABBV        2.954  NaN   53.95  Health Care
df2 A          16.928  0.0   56.18  Health Care
    GAS        32.462  0.0   52.98    Utilities
    ABBV        2.954  0.0   53.95  Health Care

We can change the axis of the concatenation to work along the columns by specifying axis = 1, which will calculate the sorted union of the distinct index labels from the rows and then append columns and their data from the specified objects.


In [131]:
# first three rows, columns 0 and 1
# causing error => df3 = sp500[:3][[0,1]]
df3 = sp500[:3][['Price','Sector']]
df3


Out[131]:
         Price       Sector
Symbol                     
MMM     141.14  Industrials
ABT      39.60  Health Care
ABBV     53.95  Health Care

In [132]:
df4 = sp500[:3][['BookValue']]
df4


Out[132]:
        BookValue
Symbol           
MMM        26.668
ABT        15.573
ABBV        2.954

In [133]:
# put them back together
pd.concat([df3,df4],axis=1)


Out[133]:
         Price       Sector  BookValue
Symbol                                
MMM     141.14  Industrials     26.668
ABT      39.60  Health Care     15.573
ABBV     53.95  Health Care      2.954

We can further examing this operation by adding a column to the second DataFrame that has a duplicate name to a column in the first. The result will have duplicate columns, as they are blindly appended.


In [134]:
# make a copy of df4
df4_2 = df4.copy()
# add a column to df4_2, that is also in df3
df4_2.insert(1,'Sector',pd.Series(1,index=df4_2.index))
df4_2


Out[134]:
        BookValue  Sector
Symbol                   
MMM        26.668       1
ABT        15.573       1
ABBV        2.954       1

In [135]:
# demonstrate duplicate columns
pd.concat([df3,df4_2],axis=1)


Out[135]:
         Price       Sector  BookValue  Sector
Symbol                                        
MMM     141.14  Industrials     26.668       1
ABT      39.60  Health Care     15.573       1
ABBV     53.95  Health Care      2.954       1

pandas is performing an outer join along the labels of the specified axis. An inner join can be specified using the join='inner' parameter.


In [137]:
# first three rows and first two columns
df5 = sp500[:3][['Sector','Price']]
df5


Out[137]:
             Sector   Price
Symbol                     
MMM     Industrials  141.14
ABT     Health Care   39.60
ABBV    Health Care   53.95

In [138]:
# row 2 through 4 and first tow columns
df6 = sp500[2:5][['Sector','Price']]
df6


Out[138]:
                        Sector   Price
Symbol                                
ABBV               Health Care   53.95
ACN     Information Technology   79.79
ACE                 Financials  102.91

In [139]:
# inner join on index labels will return in only one row
pd.concat([df5,df6],join='inner',axis=1)


Out[139]:
             Sector  Price       Sector  Price
Symbol                                        
ABBV    Health Care  53.95  Health Care  53.95

Adding rows (and columns) via setting with enlargement

Rows can also be added to a DataFrame through the oloc property. This technique is referred to as setting with enlargement.


In [140]:
# get a email subset of the sp500
# make sure to copy the slice to make a copy
ss = sp500[:3].copy()
# create a new row with index lable FOO
# and assign some values to the columns via a list
ss.loc['FOO'] = ['the sector',100,110]
ss


Out[140]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954
FOO      the sector  100.00    110.000

Note that the change is made in-place. If FOO already exists as an index label, then the column data would be replaced. This is one of the means of updating data in a DataFrame in-place as .loc not only retrieves row(s), but also lets you modify the results that are returned.

It is also possible to add columns in this manner.


In [141]:
# copy of the subset/ slice
ss = sp500[:3].copy()
# add the new column initialized to 0
ss.loc[:,'PER'] = 0
ss


Out[141]:
             Sector   Price  BookValue  PER
Symbol                                     
MMM     Industrials  141.14     26.668    0
ABT     Health Care   39.60     15.573    0
ABBV    Health Care   53.95      2.954    0

Removing rows from a DataFrame

Removing rows from a DataFrame object is normally performed using oe of the three techniques:

  • Using the drop() method
  • Boolean selection
  • Selection using a slice.

Technicall only the .drop() method removes rows in-place on the source object.

Removing rows using .drop()

To remove rows from a DataFrame by the index label, you can use the .drop() method of the DataFrame.


In [142]:
# get a copy of the first 5 rows of sp500
ss = sp500[:5].copy()
ss


Out[142]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  141.14     26.668
ABT                Health Care   39.60     15.573
ABBV               Health Care   53.95      2.954
ACN     Information Technology   79.79      8.326
ACE                 Financials  102.91     86.897

In [143]:
# drop rows with labels ABT and ACN
afterdrop = ss.drop(['ABT','ACN'])
afterdrop


Out[143]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABBV    Health Care   53.95      2.954
ACE      Financials  102.91     86.897

In [144]:
# note that ss is not modified
ss


Out[144]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  141.14     26.668
ABT                Health Care   39.60     15.573
ABBV               Health Care   53.95      2.954
ACN     Information Technology   79.79      8.326
ACE                 Financials  102.91     86.897

Removing rows using Boolean selection

Boolean selection can be used to remove rows from a DataFrame by creating a new DataFrame without the desired rows.


In [145]:
# determine the rows where Price > 300
selection = sp500.Price > 300
# to make the output shorter, report the # of row returned (500),
# and the sum of those where Price > 300 (which is 10)
"{0} {1}".format(len(selection),selection.sum())


Out[145]:
'500 10'

In [146]:
# select the complement
withPriceLessThan300 = sp500[~selection]
withPriceLessThan300


Out[146]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  141.14     26.668
ABT                Health Care   39.60     15.573
ABBV               Health Care   53.95      2.954
ACN     Information Technology   79.79      8.326
ACE                 Financials  102.91     86.897
...                        ...     ...        ...
YHOO    Information Technology   35.02     12.768
YUM     Consumer Discretionary   74.77      5.147
ZMH                Health Care  101.84     37.181
ZION                Financials   28.43     30.191
ZTS                Health Care   30.53      2.150

[490 rows x 3 columns]

Removing rows using a slice

Slicing is also often used to remove the records from a DataFrame. It is a process similar to Boolean selection, where we select out all of the rows, except for the ones you want deleted.


In [147]:
# get only the first 3 rows
onlyfirstthree = sp500[:3]
onlyfirstthree


Out[147]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954

In [148]:
# first three but a copy of them
# get only the first 3 rows
onlyfirstthree = sp500[:3].copy()
onlyfirstthree


Out[148]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954

Changing scalar values in a DataFrame

Scalar values in a DataFrame can be changed by assignment of the new value to the result of the value lookup using the .ix,.iloc and .loc attributes.


In [149]:
# get a subset / copy of the data
subset = sp500[:3].copy()
subset


Out[149]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954

In [150]:
# change scalar by label on row and column
subset.ix['MMM','Price'] = 0
subset


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  
Out[150]:
             Sector  Price  BookValue
Symbol                               
MMM     Industrials   0.00     26.668
ABT     Health Care  39.60     15.573
ABBV    Health Care  53.95      2.954

.loc may suffer from lower performance as compared to .iloc due to the possibility of needing to map the label values into locations.


In [152]:
# subset of first three rows
subset = sp500[:3].copy()
# get the location of the Price column
price_loc = sp500.columns.get_loc('Price')
# get the location of the MMM row
abt_row_loc = sp500.index.get_loc('ABT')
# change the price
subset.iloc[abt_row_loc,price_loc] = 1000
subset


Out[152]:
             Sector    Price  BookValue
Symbol                                 
MMM     Industrials   141.14     26.668
ABT     Health Care  1000.00     15.573
ABBV    Health Care    53.95      2.954

Arithmetic on a DataFrame


In [153]:
# set the seed to allow replicatable results
np.random.seed(123456)
# create the data frame
df = pd.DataFrame(np.random.randn(5,4), columns=['A','B','C','D'])
df


Out[153]:
          A         B         C         D
0  0.469112 -0.282863 -1.509059 -1.135632
1  1.212112 -0.173215  0.119209 -1.044236
2 -0.861849 -2.104569 -0.494929  1.071804
3  0.721555 -0.706771 -1.039575  0.271860
4 -0.424972  0.567020  0.276232 -1.087401

In [154]:
# multiply everything by 2
df * 2


Out[154]:
          A         B         C         D
0  0.938225 -0.565727 -3.018117 -2.271265
1  2.424224 -0.346429  0.238417 -2.088472
2 -1.723698 -4.209138 -0.989859  2.143608
3  1.443110 -1.413542 -2.079150  0.543720
4 -0.849945  1.134041  0.552464 -2.174801

In [155]:
# get first row
s = df.iloc[0]
# subtract first row from every row of the dataframe
diff = df - s
diff


Out[155]:
          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1  0.743000  0.109649  1.628267  0.091396
2 -1.330961 -1.821706  1.014129  2.207436
3  0.252443 -0.423908  0.469484  1.407492
4 -0.894085  0.849884  1.785291  0.048232

When performing an operation between a DataFrame and a Series, pandax will align the Series index along the DataFrame columns performing what is referred to as a row-wise broadcast.


In [157]:
# subtract dataframe from series
diff = s - df
diff


Out[157]:
          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1 -0.743000 -0.109649 -1.628267 -0.091396
2  1.330961  1.821706 -1.014129 -2.207436
3 -0.252443  0.423908 -0.469484 -1.407492
4  0.894085 -0.849884 -1.785291 -0.048232

The set of columns returned will be the union of the labels in the index of both the series and columns index of the DataFrame object. If a label representing the result column is not found in either the Series of the DataFrame object, then the values will be NaN filled.

The following code demonstrates, by creating a series with an index representing a subset of the column in the DataFrame, but also with an additional label.


In [159]:
# B, C
s2 = s[1:3]
# add E
s2['E'] = 0
# see how alignment is applied in math
df + s2


Out[159]:
    A         B         C   D   E
0 NaN -0.565727 -3.018117 NaN NaN
1 NaN -0.456078 -1.389850 NaN NaN
2 NaN -2.387433 -2.003988 NaN NaN
3 NaN -0.989634 -2.548633 NaN NaN
4 NaN  0.284157 -1.232826 NaN NaN

In [160]:
# get rows 1 through three and only B,C columns
subframe = df[1:4][['B','C']]
# we have extracted a little square in the middle of the df
subframe


Out[160]:
          B         C
1 -0.173215  0.119209
2 -2.104569 -0.494929
3 -0.706771 -1.039575

In [161]:
# demonstrate the alignment of the subtraction
df - subframe


Out[161]:
    A    B    C   D
0 NaN  NaN  NaN NaN
1 NaN  0.0  0.0 NaN
2 NaN  0.0  0.0 NaN
3 NaN  0.0  0.0 NaN
4 NaN  NaN  NaN NaN

In [162]:
# get the A column
a_col = df['A']
df.sub(a_col, axis=0)


Out[162]:
     A         B         C         D
0  0.0 -0.751976 -1.978171 -1.604745
1  0.0 -1.385327 -1.092903 -2.256348
2  0.0 -1.242720  0.366920  1.933653
3  0.0 -1.428326 -1.761130 -0.449695
4  0.0  0.991993  0.701204 -0.662428

Resetting and Reindexing

A DataFrame can have its index reset by using the .reset_index() method. A common use of this, to move the contents of a DataFrame object's index into one or more columns.


In [163]:
# reset the index, moving it into a column
reset_sp500 = sp500.reset_index()
reset_sp500


Out[163]:
    Symbol                  Sector   Price  BookValue
0      MMM             Industrials  141.14     26.668
1      ABT             Health Care   39.60     15.573
2     ABBV             Health Care   53.95      2.954
3      ACN  Information Technology   79.79      8.326
4      ACE              Financials  102.91     86.897
..     ...                     ...     ...        ...
495   YHOO  Information Technology   35.02     12.768
496    YUM  Consumer Discretionary   74.77      5.147
497    ZMH             Health Care  101.84     37.181
498   ZION              Financials   28.43     30.191
499    ZTS             Health Care   30.53      2.150

[500 rows x 4 columns]

One or more columns can also be moved into the inext. We can use the set_index() method for this.


In [164]:
reset_sp500.set_index('Symbol')


Out[164]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  141.14     26.668
ABT                Health Care   39.60     15.573
ABBV               Health Care   53.95      2.954
ACN     Information Technology   79.79      8.326
ACE                 Financials  102.91     86.897
...                        ...     ...        ...
YHOO    Information Technology   35.02     12.768
YUM     Consumer Discretionary   74.77      5.147
ZMH                Health Care  101.84     37.181
ZION                Financials   28.43     30.191
ZTS                Health Care   30.53      2.150

[500 rows x 3 columns]

In [165]:
# get first four rows
subset = sp500[:4].copy()
subset


Out[165]:
                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  141.14     26.668
ABT                Health Care   39.60     15.573
ABBV               Health Care   53.95      2.954
ACN     Information Technology   79.79      8.326

In [166]:
# reindex to have MMM,ABBV and FOO index labels
reindexed = subset.reindex(index=['MMM','ABBV','FOO'])
# not that ABT and ACN are dropped and FOO has NaN values
reindexed


Out[166]:
             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABBV    Health Care   53.95      2.954
FOO             NaN     NaN        NaN

Reindexing can also be done upon the columns.


In [169]:
# reindex columns
subset.reindex(columns=['Price','BookValue','New_Column'])


Out[169]:
         Price  BookValue  New_Column
Symbol                               
MMM     141.14     26.668         NaN
ABT      39.60     15.573         NaN
ABBV     53.95      2.954         NaN
ACN      79.79      8.326         NaN

Hierarchical Indexing

Hierarchical Indexing is a feature of pandas that allows specifying 2 or more index levels on an axis. A pandas index that has multiple levels of hierarchy is referred to as a MultiIndex.


In [170]:
# first push symbol into a column
reindexed = sp500.reset_index()
# and now index sp500 by sector and symbol
multi_fi = reindexed.set_index(['Sector','Symbol'])
multi_fi


Out[170]:
                                Price  BookValue
Sector                 Symbol                   
Industrials            MMM     141.14     26.668
Health Care            ABT      39.60     15.573
                       ABBV     53.95      2.954
Information Technology ACN      79.79      8.326
Financials             ACE     102.91     86.897
...                               ...        ...
Information Technology YHOO     35.02     12.768
Consumer Discretionary YUM      74.77      5.147
Health Care            ZMH     101.84     37.181
Financials             ZION     28.43     30.191
Health Care            ZTS      30.53      2.150

[500 rows x 2 columns]

In [171]:
# the index is a multiindex
# examine the index
type(multi_fi.index)


Out[171]:
pandas.core.indexes.multi.MultiIndex

In [175]:
# examine the index
multi_fi.index


Out[175]:
MultiIndex(levels=[['Consumer Discretionary', 'Consumer Discretionary ', 'Consumer Staples', 'Consumer Staples ', 'Energy', 'Financials', 'Health Care', 'Industrials', 'Industries', 'Information Technology', 'Materials', 'Telecommunications Services', 'Utilities'], ['A', 'AA', 'AAPL', 'ABBV', 'ABC', 'ABT', 'ACE', 'ACN', 'ACT', 'ADBE', 'ADI', 'ADM', 'ADP', 'ADS', 'ADSK', 'ADT', 'AEE', 'AEP', 'AES', 'AET', 'AFL', 'AGN', 'AIG', 'AIV', 'AIZ', 'AKAM', 'ALL', 'ALLE', 'ALTR', 'ALXN', 'AMAT', 'AME', 'AMGN', 'AMP', 'AMT', 'AMZN', 'AN', 'AON', 'APA', 'APC', 'APD', 'APH', 'ARG', 'ATI', 'AVB', 'AVP', 'AVY', 'AXP', 'AZO', 'BA', 'BAC', 'BAX', 'BBBY', 'BBT', 'BBY', 'BCR', 'BDX', 'BEAM', 'BEN', 'BF-B', 'BHI', 'BIIB', 'BK', 'BLK', 'BLL', 'BMS', 'BMY', 'BRCM', 'BRK-B', 'BSX', 'BTU', 'BWA', 'BXP', 'C', 'CA', 'CAG', 'CAH', 'CAM', 'CAT', 'CB', 'CBG', 'CBS', 'CCE', 'CCI', 'CCL', 'CELG', 'CERN', 'CF', 'CFN', 'CHK', 'CHRW', 'CI', 'CINF', 'CL', 'CLF', 'CLX', 'CMA', 'CMCSA', 'CME', 'CMG', 'CMI', 'CMS', 'CNP', 'CNX', 'COF', 'COG', 'COH', 'COL', 'COP', 'COST', 'COV', 'CPB', 'CRM', 'CSC', 'CSCO', 'CSX', 'CTAS', 'CTL', 'CTSH', 'CTXS', 'CVC', 'CVS', 'CVX', 'D', 'DAL', 'DD', 'DE', 'DFS', 'DG', 'DGX', 'DHI', 'DHR', 'DIS', 'DISCA', 'DLPH', 'DLTR', 'DNB', 'DNR', 'DO', 'DOV', 'DOW', 'DPS', 'DRI', 'DTE', 'DTV', 'DUK', 'DVA', 'DVN', 'EA', 'EBAY', 'ECL', 'ED', 'EFX', 'EIX', 'EL', 'EMC', 'EMN', 'EMR', 'EOG', 'EQR', 'EQT', 'ESRX', 'ESV', 'ETFC', 'ETN', 'ETR', 'EW', 'EXC', 'EXPD', 'EXPE', 'F', 'FAST', 'FB', 'FCX', 'FDO', 'FDX', 'FE', 'FFIV', 'FIS', 'FISV', 'FITB', 'FLIR', 'FLR', 'FLS', 'FMC', 'FOSL', 'FOXA', 'FRX', 'FSLR', 'FTI', 'FTR', 'GAS', 'GCI', 'GD', 'GE', 'GGP', 'GHC', 'GILD', 'GIS', 'GLW', 'GM', 'GMCR', 'GME', 'GNW', 'GOOG', 'GPC', 'GPS', 'GRMN', 'GS', 'GT', 'GWW', 'HAL', 'HAR', 'HAS', 'HBAN', 'HCBK', 'HCN', 'HCP', 'HD', 'HES', 'HIG', 'HOG', 'HON', 'HOT', 'HP', 'HPQ', 'HRB', 'HRL', 'HRS', 'HSP', 'HST', 'HSY', 'HUM', 'IBM', 'ICE', 'IFF', 'IGT', 'INTC', 'INTU', 'IP', 'IPG', 'IR', 'IRM', 'ISRG', 'ITW', 'IVZ', 'JBL', 'JCI', 'JEC', 'JNJ', 'JNPR', 'JOY', 'JPM', 'JWN', 'K', 'KEY', 'KIM', 'KLAC', 'KMB', 'KMI', 'KMX', 'KO', 'KORS', 'KR', 'KRFT', 'KSS', 'KSU', 'L', 'LB', 'LEG', 'LEN', 'LH', 'LLL', 'LLTC', 'LLY', 'LM', 'LMT', 'LNC', 'LO', 'LOW', 'LRCX', 'LSI', 'LUK', 'LUV', 'LYB', 'M', 'MA', 'MAC', 'MAR', 'MAS', 'MAT', 'MCD', 'MCHP', 'MCK', 'MCO', 'MDLZ', 'MDT', 'MET', 'MHFI', 'MHK', 'MJN', 'MKC', 'MMC', 'MMM', 'MNST', 'MO', 'MON', 'MOS', 'MPC', 'MRK', 'MRO', 'MS', 'MSFT', 'MSI', 'MTB', 'MU', 'MUR', 'MWV', 'MYL', 'NBL', 'NBR', 'NDAQ', 'NE', 'NEE', 'NEM', 'NFLX', 'NFX', 'NI', 'NKE', 'NLSN', 'NOC', 'NOV', 'NRG', 'NSC', 'NTAP', 'NTRS', 'NU', 'NUE', 'NVDA', 'NWL', 'NWSA', 'OI', 'OKE', 'OMC', 'ORCL', 'ORLY', 'OXY', 'PAYX', 'PBCT', 'PBI', 'PCAR', 'PCG', 'PCL', 'PCLN', 'PCP', 'PDCO', 'PEG', 'PEP', 'PETM', 'PFE', 'PFG', 'PG', 'PGR', 'PH', 'PHM', 'PKI', 'PLD', 'PLL', 'PM', 'PNC', 'PNR', 'PNW', 'POM', 'PPG', 'PPL', 'PRGO', 'PRU', 'PSA', 'PSX', 'PVH', 'PWR', 'PX', 'PXD', 'QCOM', 'QEP', 'R', 'RAI', 'RDC', 'REGN', 'RF', 'RHI', 'RHT', 'RIG', 'RL', 'ROK', 'ROP', 'ROST', 'RRC', 'RSG', 'RTN', 'SBUX', 'SCG', 'SCHW', 'SE', 'SEE', 'SHW', 'SIAL', 'SJM', 'SLB', 'SLM', 'SNA', 'SNDK', 'SNI', 'SO', 'SPG', 'SPLS', 'SRCL', 'SRE', 'STI', 'STJ', 'STT', 'STX', 'STZ', 'SWK', 'SWN', 'SWY', 'SYK', 'SYMC', 'SYY', 'T', 'TAP', 'TDC', 'TE', 'TEG', 'TEL', 'TGT', 'THC', 'TIF', 'TJX', 'TMK', 'TMO', 'TRIP', 'TROW', 'TRV', 'TSCO', 'TSN', 'TSO', 'TSS', 'TWC', 'TWX', 'TXN', 'TXT', 'TYC', 'UNH', 'UNM', 'UNP', 'UPS', 'URBN', 'USB', 'UTX', 'V', 'VAR', 'VFC', 'VIAB', 'VLO', 'VMC', 'VNO', 'VRSN', 'VRTX', 'VTR', 'VZ', 'WAG', 'WAT', 'WDC', 'WEC', 'WFC', 'WFM', 'WHR', 'WIN', 'WLP', 'WM', 'WMB', 'WMT', 'WU', 'WY', 'WYN', 'WYNN', 'X', 'XEL', 'XL', 'XLNX', 'XOM', 'XRAY', 'XRX', 'XYL', 'YHOO', 'YUM', 'ZION', 'ZMH', 'ZTS']],
           labels=[[7, 6, 6, 9, 5, 6, 9, 12, 6, 5, 6, 12, 10, 10, 9, 10, 6, 10, 7, 6, 9, 5, 9, 2, 0, 12, 12, 5, 5, 5, 5, 6, 9, 6, 7, 4, 9, 5, 4, 5, 9, 9, 2, 5, 11, 9, 9, 0, 0, 5, 7, 2, 4, 10, 5, 6, 6, 5, 0, 6, 0, 10, 5, 0, 6, 5, 0, 7, 0, 5, 6, 6, 9, 2, 9, 0, 4, 4, 2, 5, 6, 6, 0, 0, 7, 5, 0, 6, 12, 11, 6, 10, 7, 4, 4, 0, 5, 6, 5, 7, 9, 5, 9, 10, 2, 5, 12, 0, 2, 2, 9, 2, 0, 5, 9, 2, 4, 4, 12, 2, 7, 2, 6, 11, 7, 7, 2, 7, 0, 6, 7, 0, 7, 4, 6, 4, 4, 0, 5, 0, 0, 0, 12, 7, 10, 2, 12, 12, 7, 5, 10, 10, 7, 9, 10, 12, 6, 9, 9, 7, 4, 12, 4, 12, 5, 5, 2, 12, 0, 7, 6, 4, 9, 9, 0, 7, 7, 9, 5, 7, 12, 9, 9, 7, 7, 10, 4, 0, 6, 0, 5, 10, 11, 0, 0, 0, 0, 7, 7, 5, 2, 0, 0, 5, 6, 5, 0, 9, 0, 7, 4, 0, 0, 9, 5, 0, 5, 5, 4, 2, 4, 9, 0, 7, 2, 0, 6, 5, 5, 6, 5, 7, 7, 12, 9, 5, 0, 9, 10, 0, 10, 9, 6, 5, 7, 9, 7, 6, 0, 7, 5, 9, 7, 2, 3, 5, 2, 5, 4, 9, 0, 2, 2, 0, 7, 6, 9, 5, 0, 0, 5, 6, 5, 9, 7, 5, 2, 0, 9, 10, 5, 5, 0, 4, 4, 0, 5, 7, 9, 0, 2, 0, 5, 6, 2, 10, 6, 6, 5, 0, 9, 9, 9, 0, 2, 2, 10, 2, 5, 5, 10, 9, 4, 6, 4, 5, 4, 9, 9, 0, 4, 10, 0, 12, 8, 0, 12, 4, 4, 0, 7, 12, 5, 7, 12, 10, 9, 0, 4, 0, 12, 9, 10, 7, 7, 7, 6, 9, 4, 7, 5, 12, 2, 6, 6, 0, 6, 12, 2, 4, 12, 4, 7, 5, 5, 10, 12, 10, 7, 7, 5, 2, 5, 5, 5, 12, 5, 0, 0, 12, 9, 7, 6, 0, 4, 7, 9, 6, 5, 7, 2, 7, 7, 7, 7, 0, 4, 7, 2, 9, 9, 12, 4, 5, 0, 9, 10, 12, 0, 10, 5, 5, 2, 0, 12, 7, 4, 4, 6, 0, 0, 0, 0, 5, 7, 6, 5, 9, 2, 5, 0, 9, 12, 6, 9, 4, 9, 7, 7, 5, 4, 6, 0, 0, 0, 0, 5, 9, 1, 4, 5, 0, 0, 7, 2, 7, 7, 10, 7, 6, 5, 0, 5, 4, 6, 5, 9, 11, 6, 0, 0, 9, 5, 10, 2, 2, 0, 7, 6, 6, 5, 9, 9, 5, 0, 2, 11, 12, 0, 0, 12, 9, 9, 5, 7, 9, 0, 6, 5, 6], [303, 5, 3, 7, 6, 8, 9, 18, 19, 20, 0, 191, 40, 42, 25, 1, 29, 43, 27, 21, 13, 26, 28, 305, 35, 16, 17, 47, 22, 34, 33, 4, 31, 32, 41, 39, 10, 37, 38, 23, 2, 30, 11, 24, 429, 14, 12, 36, 48, 44, 46, 45, 60, 64, 50, 55, 51, 53, 57, 56, 52, 65, 68, 54, 61, 63, 226, 49, 71, 72, 69, 66, 67, 59, 74, 120, 105, 77, 111, 104, 76, 88, 260, 84, 78, 80, 81, 85, 102, 117, 86, 87, 90, 89, 122, 99, 79, 91, 92, 116, 114, 73, 119, 94, 95, 98, 101, 106, 261, 82, 118, 93, 97, 96, 113, 75, 108, 103, 151, 422, 199, 109, 110, 83, 115, 100, 121, 131, 142, 146, 126, 134, 124, 137, 492, 147, 138, 144, 127, 133, 128, 135, 123, 139, 140, 141, 143, 145, 136, 163, 125, 156, 164, 149, 150, 153, 166, 148, 155, 157, 162, 165, 158, 160, 152, 159, 154, 167, 169, 168, 161, 491, 177, 172, 174, 171, 175, 178, 180, 188, 176, 179, 181, 183, 182, 184, 189, 170, 187, 185, 58, 173, 190, 202, 192, 206, 207, 193, 194, 195, 198, 200, 205, 203, 197, 208, 209, 204, 196, 210, 211, 221, 212, 228, 220, 213, 217, 216, 224, 231, 219, 225, 218, 222, 227, 130, 229, 230, 215, 232, 214, 244, 241, 433, 237, 234, 240, 233, 235, 236, 239, 238, 243, 245, 242, 246, 248, 249, 247, 251, 252, 250, 266, 254, 201, 255, 258, 256, 259, 257, 265, 264, 263, 268, 272, 271, 280, 275, 269, 270, 282, 274, 277, 273, 276, 267, 278, 279, 281, 284, 314, 287, 285, 310, 308, 288, 302, 289, 286, 290, 301, 291, 298, 293, 300, 317, 296, 309, 297, 262, 292, 315, 312, 299, 430, 295, 306, 304, 294, 311, 307, 313, 316, 318, 320, 321, 331, 334, 325, 339, 326, 324, 340, 323, 329, 328, 327, 322, 319, 253, 333, 336, 335, 330, 332, 337, 338, 345, 346, 343, 342, 344, 341, 350, 367, 363, 355, 347, 70, 370, 348, 372, 357, 365, 375, 358, 359, 351, 368, 378, 371, 382, 349, 352, 369, 373, 374, 381, 354, 353, 360, 361, 362, 366, 376, 356, 377, 364, 379, 384, 383, 380, 129, 393, 397, 399, 391, 388, 389, 398, 386, 390, 394, 107, 395, 396, 387, 385, 425, 112, 411, 401, 408, 402, 412, 421, 404, 417, 405, 406, 414, 409, 407, 410, 413, 283, 424, 403, 419, 423, 415, 400, 223, 420, 416, 426, 418, 427, 428, 442, 435, 434, 432, 436, 431, 446, 450, 451, 15, 62, 481, 440, 437, 448, 449, 438, 439, 447, 444, 392, 443, 441, 186, 452, 445, 455, 456, 487, 459, 453, 454, 457, 458, 464, 461, 469, 467, 470, 468, 462, 463, 460, 466, 465, 482, 471, 132, 480, 472, 479, 475, 473, 483, 484, 477, 476, 478, 474, 485, 486, 488, 493, 490, 489, 494, 495, 496, 498, 497, 499]],
           names=['Sector', 'Symbol'])

In [176]:
# this has 2 levels
len(multi_fi.index.levels)


Out[176]:
2

In [177]:
# each index level is an index
multi_fi.index.levels[0]


Out[177]:
Index(['Consumer Discretionary', 'Consumer Discretionary ', 'Consumer Staples',
       'Consumer Staples ', 'Energy', 'Financials', 'Health Care',
       'Industrials', 'Industries', 'Information Technology', 'Materials',
       'Telecommunications Services', 'Utilities'],
      dtype='object', name='Sector')

Values of the index, at a specific level for every row, can be retrieved by the .get_level_values() method:


In [178]:
# values of the index level 0
multi_fi.index.get_level_values(0)


Out[178]:
Index(['Industrials', 'Health Care', 'Health Care', 'Information Technology',
       'Financials', 'Health Care', 'Information Technology', 'Utilities',
       'Health Care', 'Financials',
       ...
       'Utilities', 'Information Technology', 'Information Technology',
       'Financials', 'Industrials', 'Information Technology',
       'Consumer Discretionary', 'Health Care', 'Financials', 'Health Care'],
      dtype='object', name='Sector', length=500)

In [179]:
# get all the stocks that are industrials
# note the result drops level 0 of the index
multi_fi.xs('Industrials')


Out[179]:
         Price  BookValue
Symbol                   
MMM     141.14     26.668
ALLE     52.46      0.000
APH      95.71     18.315
AVY      48.20     15.616
BA      132.41     19.870
...        ...        ...
UNP     196.26     46.957
UPS     102.73      6.790
UTX     115.54     35.252
WM       43.37     12.330
XYL      38.42     12.127

[64 rows x 2 columns]

In [181]:
# select rows where level 1 is ALLE
# note that the sector level is dropped from the result
multi_fi.xs('ALLE', level=1)


Out[181]:
             Price  BookValue
Sector                       
Industrials  52.46        0.0

In [182]:
# Industrials, without dropping the level
multi_fi.xs('Industrials', drop_level=False)


Out[182]:
                     Price  BookValue
Sector      Symbol                   
Industrials MMM     141.14     26.668
            ALLE     52.46      0.000
            APH      95.71     18.315
            AVY      48.20     15.616
            BA      132.41     19.870
...                    ...        ...
            UNP     196.26     46.957
            UPS     102.73      6.790
            UTX     115.54     35.252
            WM       43.37     12.330
            XYL      38.42     12.127

[64 rows x 2 columns]

In [183]:
# drill through the levels
multi_fi.xs('Industrials').xs('UPS')


Out[183]:
Price        102.73
BookValue      6.79
Name: UPS, dtype: float64

In [185]:
# drill through using tuples
multi_fi.xs(('Industrials','UPS'))


Out[185]:
Price        102.73
BookValue      6.79
Name: (Industrials, UPS), dtype: float64

Summarized data and descriptive statistics

pandas provides several classes of statistical operations that can be applied to a Series or DataFrame object. These reductive methods, when applied to a Series, result in a single value.

When applied to a DataFrame, an axis can be specified and the method will then be either applied to each column or row and results in a Series.


In [186]:
# calc the mean of the values in each column
one_mon_hist.mean()


Out[186]:
MSFT     47.493182
AAPL    112.411364
dtype: float64

In [187]:
# calc the mean of the values in each row
one_mon_hist.mean(axis=1)


Out[187]:
0     81.845
1     81.545
2     82.005
3     82.165
4     81.710
       ...  
17    80.075
18    80.935
19    80.680
20    79.770
21    78.415
Length: 22, dtype: float64

In [188]:
# calc the variance of the values in each column
one_mon_hist.var()


Out[188]:
MSFT    0.870632
AAPL    5.706231
dtype: float64

In [189]:
# calc the median of the values in each column
one_mon_hist.median()


Out[189]:
MSFT     47.625
AAPL    112.530
dtype: float64

In [190]:
# location of min price for both stocks
one_mon_hist[['MSFT','AAPL']].min()


Out[190]:
MSFT     45.16
AAPL    106.75
dtype: float64

In [191]:
# and location of the max
one_mon_hist[['MSFT','AAPL']].max()


Out[191]:
MSFT     48.84
AAPL    115.93
dtype: float64

Some pandas statistical methods are referred to as indirect statistics, for example, .idxmin() and .idxmax() return the index location where the minimum and maximum values exist respectively.


In [192]:
# location of the min price for both stocks
one_mon_hist[['MSFT','AAPL']].idxmin()


Out[192]:
MSFT    11
AAPL    11
dtype: int64

In [193]:
one_mon_hist[['MSFT','AAPL']].idxmax()


Out[193]:
MSFT    3
AAPL    2
dtype: int64

In [195]:
# find the mode of the Series
s = pd.Series([1,2,3,3,5])
s.mode()


Out[195]:
0    3
dtype: int64

In [196]:
# there can be more than one mode
s = pd.Series([1,2,3,3,5,1])
s.mode()


Out[196]:
0    1
1    3
dtype: int64

Accumulations in pandas are statistical methods that determine a value by continuously applying the next value in a Series to the current result.


In [197]:
# calculate a cumulative product
pd.Series([1,2,3,4]).cumprod()


Out[197]:
0     1
1     2
2     6
3    24
dtype: int64

In [198]:
# calculate a cumulative sum
pd.Series([1,2,3,4]).cumsum()


Out[198]:
0     1
1     3
2     6
3    10
dtype: int64

In [199]:
# summary statistics
one_mon_hist.describe()


Out[199]:
            MSFT        AAPL
count  22.000000   22.000000
mean   47.493182  112.411364
std     0.933077    2.388772
min    45.160000  106.750000
25%    46.967500  111.660000
50%    47.625000  112.530000
75%    48.125000  114.087500
max    48.840000  115.930000

Non-numerical will result in a slightly different set of summary statistics.


In [202]:
# get summary statistics on non-numeric data
s = pd.Series(['a','a','b','c',np.NaN])
s.describe()


Out[202]:
count     4
unique    3
top       a
freq      2
dtype: object

In [201]:
# get summary stats of non-numeric data
s.count()


Out[201]:
3

In [203]:
# return a list of unique items
s.unique()


Out[203]:
array(['a', 'b', 'c', nan], dtype=object)

In [204]:
# number of occurences of each unique value
s.value_counts()


Out[204]:
a    2
b    1
c    1
dtype: int64